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Abstract. Many real world data are sampled functions. As shown by Functional 
Data Analysis (FDA) methods, spectra, time series, images, gesture recognition 
data, etc. can be processed more efficiently if their functional nature is taken into 
account during the data analysis process. This is done by extending standard data 
analysis methods so that they can apply to functional inputs. A general way to 
achieve this goal is to compute projections of the functional data onto a finite 
dimensional sub-space of the functional space. The coordinates of the data on a 
f^ ^ • basis of this sub-space provide standard vector representations of the functions. The 

jyj ' obtained vectors can be processed by any standard method. 

^ , In [43] , this general approach has been used to define projection based Multilayer 

Perceptrons (MLPs) with functional inputs. We study in this paper important the- 
oretical properties of the proposed model. We show in particular that MLPs with 
functional inputs are universal approximators: they can approximate to arbitrary 
accuracy any continuous mapping from a compact sub-space of a functional space 
f^ I to ]R. Moreover, we provide a consistency result that shows that any mapping from 

f^ . a functional space to M can be learned thanks to examples by a projection based 

^^ ' MLP: the generalization mean square error of the MLP decreases to the smallest 

f^ , possible mean square error on the data when the number of examples goes to infinity. 

\o 

c7^ ' Keywords: Functional Data Analysis; Multilayer Perceptron; Universal Approxi- 

^! I mation; Consistency; Projection 
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1. Introduction 

In many practical situations, input data are in fact sampled functions 
rather than standard high dimensional vectors. This is the case for 
instance in spectrometry: a discretized spectrum is obtained by mea- 
suring the transmittance or the reflectance of an object at different 
wavelengths. Modern spectrometers can produce very high resolution 
spectra, with a thousand of observations for each spectrum. 
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Another general example of sampled functions is given by time 
series. Indeed, a time series is a mapping from a time period to a 
observation range, for instance the hourly temperature at a weather 
station over one month. More complex examples can be found in mete- 
orology, for instance rainfall maps, i.e., functions that map geographical 
coordinates and date to the daily rain level observed at the specified 
position and date. 

Functional Data Analysis (FDA) [4, 37] is a general methodology 
targeted at data that are better described as functions than as vectors. 
The main idea is to take advantage of the functional nature of the 
data to design better data analysis methods than the ones constructed 
thanks to a vector model. For a comprehensive introduction to FDA 
methods we refer the reader to [36] in which extensions of classical data 
analysis tools to functional data, developed since pioneering works such 
as [17] and [14], are precisely described. 

The simplest case of FDA corresponds to a situation in which all 
considered functions are discretized at the same points. More precisely, 
if we consider n functions g^, . . . ,g^ and m sampling points xi, . . . , Xm, 
we obtain n vector from M"^, {g'^{xi), . . . ,g^{xm))- While direct com- 
parison between vectors remains possible, this type of data suffers from 
two drawbacks: high dimension vectors and high correlation between 
variables. [36] focuses on this situation and provides solutions that 
explicitly use the underlying functions {g^)i<i<n- The general method- 
ology uses the fact that most multivariate data analysis methods are 
based on scalar products and/or distance calculations which can be 
easily translated from a finite dimensional space to a functional space. 
A simple example is given by linear regression: if we want to predict a 
target variable in M, y* with a linear model on g^, the classical model 
on the discretized function tries to model y* as: 

m 

y' = wo + Y^ Wjg'ixj) + e' (1) 

A functional version is given by [37, 23, 7]: 

y-" = wo+ / w{x)g''{x)dx + e\ (2) 



in which most numerical parameters of the model (ttJi, . . . ^Wm) have 
been replaced by an unique functional parameter. A simple yet powerful 
idea to implement the functional version of the model is to estimate 
each g^ thanks to the corresponding vector ((/*(xi), . . . , (7*(a;m)) and 
then to work on the approximated function. Classical solutions are 
based on spline approximations of both gr* and w (see [23, 33, 8] for 
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instance) . They solve the variable correlation problem by reducing the 
effective dimension of the functional parameter thanks to regularity 
assumptions (e.g. bounded second derivative). 

A very interesting side effect of estimating g'^ thanks to its discretized 
version is to allow processing of irregularly sampled functions that are 
quite common in many applications, especially medical ones (see e.g. [6, 
24, 27, 38]). Group smoothing techniques have been developed for these 
types of data: rather than estimating each g^ independently, one can 
try to optimize the global representation of all examples, either by EM 
like methods [27, 38] or using hybrid splines and cross-validation [3]. 
Moreover, functional transformations (such as derivative calculation, 
see [19, 18]) can be performed on the representation. It is therefore 
obvious that the functional view of high dimensional data gives much 
more possibilities than the bare multivariate analysis. 

Many classical data analysis tools have been adapted to functional 
data. Principal Component Analysis was the first method studied in a 
functional framework by [17] and [14] (see also [15, 36] and [27]). Other 
linear methods have been studied more recently, such as Canonical 
Correlation Analysis [30], linear discriminant analysis [22, 26] and linear 
regression (as presented above [37, 23, 7]). Non linear models such as 
generalized linear models [25], slice inverse regression [21] and non- 
parametric kernel based estimation [19, 18] have also been reformulated 
to work on functional data. Unsupervised classification of functions 
has also been studied, as a quantization problem in [32] and more 
traditionally with k-means like approaches [1] or mixture models [28]. 
Additional references and discussions about functional data analysis 
can be found in [36]. 

Neural models have been recently adapted to functional data (see 
[41, 43, 42, 16, 39]). Building on extensions of multilayer perceptrons 
(MLPs) to arbitrary inputs studied in [44, 45, 46, 47], we have proposed 
in [41] a functional multilayer perceptron (FMLP) based on approxi- 
mate calculation of some integrals. While this model has interesting 
theoretical properties (cf [41, 40]) and gives very satisfactory results 
on real world benchmarks, it suffers from the need of a specialized 
implementation and from long training times. In [43], we have pro- 
posed another functional MLP based on projection operators. This 
method has some advantages over the one studied in [41], especially 
because the projections can be implemented as a pre-processing step 
that transforms functions into adapted vector representations. The vec- 
tors obtained like this are then processed by a standard neural model. 
We have shown in [43] that this functional model performs very well 
on real world data. However, this illustration was only experimental. 
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In this paper, we study theoretically the capabilities of projection 
based functional MLPs. We first recall in section 2 the definition of 
the functional multilayer perceptron and its projection based imple- 
mentation. In section 3, we show that any continuous function from 
a compact sub-space of a functional space to IR can be approximated 
arbitrarily well by projection based FMLPs, which are therefore uni- 
versal approximators. In section 4, we show that functional MLPs can 
learn arbitrary mappings from a functional space to JR. More precisely, 
we show that the asymptotic generalization error of functional MLPs 
converges to the minimum possible error, provided the training is done 
properly. Proofs are gathered in section 6. 



2. Multilayer perceptrons with functional inputs 

2.1. Introduction 

In this section, we recall the definition of functional multilayer percep- 
trons given in [43]. We focus on regular functions. More precisely, we 
denote ^ a a-finite positive Borel measure defined on IRF and L'^{fJ-) 
the space of measurable real valued functions^ defined on M^ and such 
that J f^dfi < oo. L'^{fi) is a Hilbert space equipped with its natural 
inner product {f,g) = J fgdi^ (we denote ||/||2 = y/ifj))- 

To avoid cumbersome notations, this paper is restricted to data 
described by a single function valued variable. However, the results can 
be easily extended to the case of data described by several functional 
variables. We also restrict ourselves to one real valued output, but 
results are also valid for vector valued output. 

2.2. Theoretical model 

As recalled in the introduction and explained in [36] , many data analy- 
sis methods are based on the Hilbert structure of the input space rather 
than on its finite dimension. Using this idea, [43] defines multilayer 
perceptrons with functional inputs, as recalled here. 

A multilayer perceptron (MLP) consists in neurons that perform 
very simple calculations. Given an input x G M^, the output of a neuron 
is 

r(/3o + ^Aa;i|, (3) 



^ More precisely, L^ (/i) contains equivalence classes of functions that differ only 
on a /i-negligible set. 
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where Xj is the i-th. coordinate of x, T is an activation function from 
M to ]R, and Po, . . . , (3p are numerical parameters (the weights of the 
neuron). 

The sum Yl^=i l^i^i is in fact the inner product in ]R^ between x and 
(/3i, . . . , I3p). As proposed in [41, 43], a functional neuron can be defined 
thanks to the inner product in L'^{h). Given an input g € L^(/i), the 
output of a functional neuron is 

r(/3o + {w, g)) =T(^f3o + j wgdfi^ , (4) 

where ti; is a function from L'^(fi), the "weight function". This func- 
tional neuron is in fact a special case of neurons with arbitrary input 
spaces defined in previous theoretical works [44, 45, 46, 47]. 

As the output of a generalized neuron is a numerical value, we need 
such neurons only in the first layer of the MLP. Indeed, the second 
layer uses only outputs from the first layer which are real numbers and 
therefore consists in numerical neurons. For example, a single hidden 
layer perceptron with an unique output neuron maps a functional input 
g to 

H{g) = Y, aiT (poi + / wigdf^ , (5) 

1=1 ^ -^ ' 

where L denotes the number of hidden (functional) neurons and 
ai , . . . , a^ are real valued connexion weights of the output neuron (it 
has a linear activation function). 



2.3. Projection 

While the model presented in the previous section is a simple gener- 
alization of its numerical counterpart, it cannot be used in practice, 
as only a limited class of functions can be easily manipulated on a 
computer. Those functions are obtained as combinations (sum, prod- 
uct, composition, etc.) of elementary functions: polynomial functions, 
trigonometric functions, etc. 

In order to solve this problem, FDA methods rely in general on 
projections. Let us indeed consider a finite p-dimensional subspace of 
L^(/i), denoted V.p. The main principle of projection based FDA meth- 
ods is to constrain all manipulated functions to belong to Vp rather 
than to 1? (n) . This constraint is implemented thanks to an orthogonal 
projection on Vp. More precisely, let us denote lip the orthogonal pro- 
jection operator on Vp. Given an arbitrary input function g, the output 
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of a functional neuron constructed thanks to Vp is given by 

T (f3o + J Up{w)Up{g)dfiY (6) 

The main advantage of using Vp is that it can be obtained as the vector 
space spanned by "computer friendly" functions, that is, functions that 
are easy to evaluate on a computer. One possibility consists in using a 
Hilbert basis of L'^{n), that is a complete orthonormal system {(pk)k£iN* ■ 
Useful examples include wavelets and trigonometric functions. Then Vp 
is defined as the vector space spanned by {4'k)i<k<p- 

Another possibility consists in using spline spaces, that is vector 
spaces of piecewise polynomial functions, or more generally, specific Vp 
that have been chosen because lip is easy to calculate and functions in 
Vp are easy to manipulate. 

On the theoretical point of view and in the general case, Vp is given 
by an orthonormal basis {(t)p^k)i<k<p- This basis allows to identify Vp 
with ]RP . We denote TTp the coordinate map, that is the function from 
L'^ifJ^) to IRP that maps g to the coordinates of Ilp{g) on the basis 
(^p,fc)i<fc<p, i-e., to a vector in IRP such that 11^(5') = Yll=i '^p{a)k4'p,k- 
We have: 

Up{w)Ilp{g)dfi = '^Trp{w)kiTp{g)k. (7) 

•^ fc=i 

This shows, as explained in [43], that the projection approach cor- 
responds to a pre-processing step that transforms functional inputs 
into finite dimensional inputs. A simple way to implement a projection 
based functional MLP consists in using a standard MLP to which the 
p coordinates of the projected functions are submitted (the MLP uses 
therefore standard vector inputs in M^). The resulting model gives 
exactly the same output as a functional MLP build for functional inputs 
in Vp. 



3. Universal approximation 

3.1. Definition 

This section is dedicated to the approximation capabilities of the func- 
tional MLP described in the previous section. We first recall a definition 
of universal approximation. 

If A and B are two topological spaces, we denote C{A, B) the set of 
continuous functions from AtoB. 



fmlp-projectioii-npl-preprint.tex; 1/02/2008; 21:02; p. 6 



Definition 1. Let X be a topological space and ,B be a set of con- 
tinuous functions from X to ]R. We say that B has the universal 
approximation property for X if for any compact subset of X, K, B is 
dense in C{K, M) for the uniform norm. 

In other words, ii B has the universal approximation property for X, for 
any compact subset K, any continuous function / from K to M, and 
any requested precision e > 0, there is g G B such that sup^.^^^ \fix) — 
9{x)\ = IIZ-^lloo < e- 

3.2. Projection and universal approximation 

When functions are processed thanks to a projection, approximation 
capabilities depend both on the neural model and on the projection. 
It is quite obvious that universal approximation cannot be reached if 
MLPs are constrained to work on a fixed Vp subset. Indeed, most of the 
functions in L^(/u) are very poorly approximated by their projections 
on Vp for a fixed set of functions {4'k)i<k<p- Therefore, the neural 
models have not enough information on their actual inputs to provide 
meaningful outputs. To solve this problem, we need to consider more 
and more precise projections. 

Definition 2. Let us consider a sequence of functions from Lp'{^) 
i4'p,k)pe]N* ,i<k<p such that for each p, {<j)p^k)i<k<p is an orthonormal 
system. We denote Vp the subspace of L'^in) spanned by {4>p^k)i<k<p 
and lip the orthogonal projection operator on Vp. 

Let Q he a. subset of i^(^). The sequence iJ\.p)p^]N* (and the 
corresponding sequence of functions) is said to have the point-wise 
approximation property for Q if lip converges to Idg on Q for the 
point-wise convergence: for all g £ Q, limp^oo llripls') — 5112=0. 

A simple example of sequence with the point-wise approximation prop- 
erty for L'^{fi) is given by any Hilbert basis {4'k)k€]N* of this space. 
Indeed, any function g in -L^(//) has a series expansion g = YlT=i 9k4'k- 
Therefore, the sequence defined by (^p^^ = (j)]. has obviously the 
point-wise approximation property. 

Thanks to those increasingly accurate projections, we can con- 
struct a set of MLP based functions with the universal approximation 
property for L^(/u). 

Theorem 1. Let T be a continuous non polynomial function from ]R 
to IR and let {4>p^k)pe]N* ,i<k<p be a sequence of functions from Lp'{^) 
with the point-wise approximation property for L'^{fi). Let us denote 
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S(T, {4>p^k)p&iN* ,i<k<p) the set of functions from Li^ip) to IR of the form 

L / p \ 

g ^^aiT [PiQ + ^ PikT^p{g)k 1 , 

1=1 \ k=l J 

where L S IN*, p E IN*, Pik € M and ai (z M {iTp is the coordinate 
map defined in section 2.3). 

Then S{T, {4>p^k)pe]N* ,i<k<p) has the universal approximation prop- 
erty for L'^ifi). 



3.3. Relation to previous works 

A lot of work has been done on the topic of universal approximation 
properties of multilayer perceptrons (see, e.g., [47, 34] for reviews). 
For functional inputs, pioneering work can be found in [10]. This 
paper proves that single hidden layer perceptrons with functional in- 
puts have the universal approximation property for C{[a,b], M) and 
L^([a, 6]). Those results are based either on the exact calculation of 
some specific integrals (for LP{[a,b])) or on a vector representation of 
functions based on an evaluation map (for C{[a,b],IR)): g is replaced 
by {g[xi), . . . ,g{xn))- Those results have been improved in more recent 
papers [12, 9]. 

Other pioneering work can be found in [44]: this paper shows that 
some specific feed-forward architecture with functional inputs has the 
universal approximation property. This result relies on perfect calcula- 
tion of inner products. Generalizations of this result can be found in 
[46, 47]. 

Finally, [11] studies a projection based approach for Radial Basis 
Function Network and [45] studies the approximate realization of the 
model proposed in [44] thanks to projection. Both works are related 
to the model proposed in the present paper. The novelty of our ap- 
proach consists in allowing complex projection methods whereas [11, 45] 
are limited to truncated basis representation. The complex projec- 
tion methods covered by Theorem 1, especially those based on spline 
approximations, have been used successfully in [43] for real world data. 



4. Consistency 

4.1. Introduction 

While universal approximation is an important property, it is not suffi- 
cient to ensure that the considered model can be used with success for 
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some machine learning task. Another problem must be assessed: is it 
possible to design, from a finite set of examples, a functional MLP such 
that when the number of examples goes to infinity, the FMLP provides 
a more and more accurate approximation of the underlying relationship 
between the input functions and the numerical outputs? This question 
(the "learnability" ) has been studied in details in the case of numerical 
MLP, see e.g. [48, 2, 31]. 

To give a precise mathematical translation of this question, we in- 
troduce the following notations (we follow [31]). We denote {G,Y) a 
pair of random variables, defined on probability space {Q,M., P), that 
take their values from -L^(^) and M, respectively. Our goal is to predict 
the value of Y given G. To assess the quality of this prediction, we need 
an error measure. In this paper, we use the root mean square error, but 
any Lp-error could be used^. Given a function h from L?'{ii) to IR, the 
root mean square prediction error is defined as 

C{h)=^[{h{G)-YfY, (8) 

where E [.] denotes the expectation. If we assume that E [|1^P] < oo, 
then C is minimized by the conditional expectation of Y given G, i.e., 
by h{g) = E \Y\G = g\. We denote C* the minimal root mean square 
error, i.e. 

C* = infC(/i) = E[(E[F|G] -y)^]^. (9) 

We have no information about the distribution of (G, 1"), except for n 
independent, identically distributed (i.i.d.) copies of (G, y), 

z)„ = ((Gi,yi),...,(G",y")). 

Using this data set, we can build a prediction model /i„ (from i^(/t/) to 
]R). The model depends on Z)„ and its performances are given by the 
following random variable 

C{K)=^[{K{G)-Yf\Dr]^. (10) 

A sequence of prediction models {hn)nG]N* is universally consistent 
(see [31]) if C{hn) converges almost surely to C*, for any distribution 
{G,Y) satisfying E [|yp] < oo. The intuitive interpretation of this 
condition is that given enough data (when n goes to infinity), the root 
mean square error of /i„ will be arbitrarily close to the best possible 
root mean square error: we are indeed learning the relationship between 



^ There is no relation between the functional input space L^{i-i) and the use of 
the mean square error. 



fmlp-projection-npl-preprint.tex; 1/02/2008; 21:02; p. 9 



10 

Y and G from examples. Another way to look at the condition is to 
rewrite it into the following equivalent condition: 

E [{hn{G) - E [Y\G]f\Dj ^ > a.s. (11) 

This condition means that hn{G) is arbitrarily close to E [y|G] for the 
mean square error. 

4.2. Projection and consistency 

In this section, we restrict the projection approach to the simple case 
of sequences of projection spaces constructed thanks to a Hilbert basis 
of the functional space. More precisely, we assume given (i^p)pew* a 
Hilbert basis of -L^(/u). We denote Vp the sub- vector space spanned by 
('?^fc)i<fc<p and define TTp as in section 2.3. In order to build a consistent 
learning method based on projection on Vp spaces, we need to adapt 
the expressive power of the candidate neural networks to the size of 
the learning set (i.e., to n). Rather than choosing an arbitrary single 
hidden layer perceptron, we restrict the search to some classes of such 
perceptrons. More precisely, given (L„)„gjv* a sequence of integers and 
(an)new* a sequence of positive real values, we define Tinp, a sequence 
of single hidden layer functional perceptron classes, by: 



nnp={heC{L\fi),lR) 



p 




in) 



1=1 \ k=l 

In those classes, L„ and a„ provide a type of regularization by adapting 
the number of hidden neurons (thanks to L„) and the magnitude of the 
weights of the output layer (thanks to a„) to the size of the learning 
set. A consistent sequence of models will be obtained by choosing the 
best single hidden layer perceptron in TCnp, according to the empirical 
error (see Theorem 2). 

To obtain consistency, we need some technical hypotheses: 

(H-1) T is a function from M to [0, 1], monotone non decreasing, with 
lim^_^oo T{x) = 1 and lim^_^_oo T{x) = 0; 

(H-2) {Ln)ne]N* and {an)neiN* are such that 

lim Ln = oo 

n— »oo 

lim a„ = oo; 

n— >oo 



fmlp-projection-npl-prepriiit.tex; 1/02/2008; 21:02; p. 10 



11 

(H-3) {Ln)n&]N* and (a„)„g^* are such that 

L„a^ log(L„a„) 
hm = 0, 

n— >oo fi 

and such that there is (5 > such that 

hm 4^ = 0. 

n— >oo n 

Hypothesis (H-1) corresponds to a standard requirement for activation 
functions of multilayer perceptrons. It is fulfilled for instance by T{x) = 
1/(1 + e-^). 

Hypothesis (H-2) ensure that the expressive power of the considered 
classes is not limited asymptotically, as the regularization vanishes 
asymptotically. 

Hypothesis (H-3) corresponds the regularization. The constraints 
come from [31] (stronger constraints were used in [48]). They control 
the way the expressive power of Tinp grows with n. 

Some possible choices for L„ and a„ include L„ = [logn] (where 
\x\ denotes the smallest integer greater or equal to x) and a„ = n& . 

Under those hypotheses, we have the following consistency result. 

Theorem 2. Let hnp be a function that minimizes the empirical mean 
square error in Hnp, i.e. such that 

i=l i=l 

for all h € H-np- 

Under hypotheses (H-1), (H-2) and (H-3), we have 

lim lim Cihnp) = C* [a.s.], 

p— >oo n— »oo 

for all distributions of (G, y) such that E [|yp] < oo. 

The theorem means that a functional MLP hnp that minimizes its mean 
square error on a training set with n examples provides a more and 
more accurate approximation of E [y|G] when n goes to infinity. The 
theorem provides some rules on L„ and a„ that allow to avoid over- 
fitting and guarantee good generalization. The only limitation of this 
result comes from the sequential limit: the theorem does not provide 
guidelines to link p io n. 

It should be noted that the theorem could be adapted to any model 
that is universally consistent in finite dimension. 
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5. Conclusion 

We have demonstrated in this paper two important results for projec- 
tion based functional multilayer perceptrons: they have the universal 
approximation property and they can learn arbitrary mapping. Thanks 
to the representation of the studied functions through projection, we 
adapt the strong results available for standard numerical MLPs to 
functional MLPs. This gives a satisfactory theoretical backing to the 
method proposed in [43]. 

However, some questions remain open, especially if we want to fully 
justify the method illustrated in [43]. The more important point is the 
choice of the projection quality, i.e., of Vp. In [43], it was determined 
thanks to the input data alone. The goal was to limit the distortion 
between Ilp{g) and g. Further theoretical investigation of this method 
is needed. 

Another possibility is to use a split sample or a re-sampling tech- 
nique to choose an optimal Vp, as we did in [41]. In practice, this 
introduces a huge computational load, without much practical gain. 
However, models constructed like this are universally consistent in the 
case of classification [5] . 

In practice, some very good results have been obtained in [20] thanks 
to an automatic construction of Vp based on a functional version of 
the Slice Inverse Regression. While the authors provide important 
theoretical results, the consistency of this method remains an open 
question. 

Finally, the second more important open question is related to the 
very nature of functional data: in practice, functional data are always 
given as finite sets of (input, output) pairs. As a consequence, projected 
functions cannot be exactly computed and are replaced by approxima- 
tions (see [43] for details). The effects of this approximation on the 
capabilities of functional MLP have been partially studied in [13] but 
the consistency of models constructed thanks to approximate values is 
not yet established. 



6. Proofs 

6.1. Theorem 1 

This theorem is based on results given in [47] for MLPs with arbitrary 
inputs. 

Let us consider a compact subset of L^(/i), K. We want to approx- 
imate functions in C{K,1R) by functions in S{T, (0p,fc)peW*,i<fc<p)- 
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6.1.1. Step one 

As a first step, we prove that the sequence of operator (Ilpjpgw con- 
verges to IdK uniformly on K, i.e. for r? > 0, there is P such that for 
each p > P and for each g £ K, \\I[p{g) — g\\2 < rj {P does not depend 
ong). 

Let us consider g^ € K^ and K(gQ,r) = B{gQ,r) fl K neighborhood 
of go in K, where B{gQ,r) denotes the open ball of radius r cen- 
tered on gQ. As (Ilp)p^]N* has the point-wise approximation property, 
there is Pq such that for each po ^ ^O) llnpols'o) — 50 lb < vf^- For 
each g € K{go,r), we have WUp^ig) - gh < ||npo(5') - Up^,{go)\\2 + 
l|npo(5o) — 50 lb + \\go — g\\2- As requested above, the middle term is 
smaller than r//2. As Upg is Lipschitz continuous (the Lipschitz constant 
is 1), ||npo(5r) - Ilpa{go)\\2 < \\g - goh- Therefore, ||npo(5r) - c/lb < 
r//2 + 2||5r-g'o||2- As a consequence, Vg G 7^(30, ??/4), ||npo(g')-g'||2 < r/. 
As K is compact, it is covered by a finite number of B{gi^'q/A) (and 
therefore of K{gi,r]/4)). We consider P = maxPj, which allows to 
conclude. 

6.1.2. Step two 

Let us now denote S{T, L'^{fi)) the set of functions form L'^{ij) to ]R of 

the form 

L 

g^'^aiT{(3io + {wi,g)), (13) 

1=1 

where I G IN*, p G IN*, /3;o G M and wi G ^^(/u). Then, S{T,L'^{fi)) 
has the universal approximation property for L^{fJ-). 

Indeed, Corollary 5.1.2 of [47] can be applied, as its conditions are 
fulfilled: 

— L'^ijj) is locally convex and is isometric to its topological dual; 

— as T is continuous and non-polynomial, single hidden layer per- 
ceptrons using T as their activation function have the universal 
approximation property for ]R (see [34] for instance). 



6.1.3. Step three 

Let us now consider a continuous function F from K to M and let e > 

be an arbitrary precision. 

According to step two, there is H £ S{T, L'^{fi)), given by equation 
13, such that for all g €K, \H{g) - F{g)\ < f . 

As H is continuous on L'^{fi), for each g £ K there is rj{g) > such 
that for each / G B(g,rj{g)), we have \H(g) — H{f)\ < |. As i^ is com- 
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pact, it is covered by a finite number of tlie bails { B ( Oj, "H^^ 

\ \ ^ J J \<i<N 

We denote r\ = mini<j<jvr/((7j). 

According to Step one of the proof, tliere is p sucli that for all 5 G K^ 
\\Iip{g) — g\\2 < ^. There is i such that g falls in B Igi, ^I^^ J and we 
have 

W^pia) - gih < W^pia) - ah + II^ - aih < vidi), 

which implies 

\H{Up{g)) - H{g)\ < \H{Up{g)) - H{g,)\ + \H{g,) - H{g)\ < |, 

by using twice the continuity of H at gi. We have therefore 

\H{Upig)) - Fig)\ < e. 

To conclude, we note that {■wi,Ilp{g)) = {Ilp{wi),Ilp{g)) = 
jyk=i'^pi''^l)k'^pid)kj which means that if o lip belongs to 

•S{T, {4'p,k)pe]N* ,i<k<p) ■ 

6.2. Theorem 2 

Theorem 2 is based on theorem 3 from [31]. The latter applies to 
standard MLPs with inputs in ]R and provides universal consistency. 

6.2.1. Step one 

To use theorem 3 from [31], we need to introduce additional notations. 
We denote Gp = TTp{G). As vr^ is continuous, Gp is a random variable 
that takes values from IRP. We denote Gp = 7rp(G*). Obviously, for any 
p, DP = {{Gl, yi), . . . , (G;J, y")) consists in n i.i.d. copies of (Gp, Y). 
If /„ is a measurable function from M^ to M constructed thanks to 
Dn, we denote 



Cp{U) = B[{UGp)-Yf\DP 
We denote 



1 



Cp*=infE[(/(Gp)-y)^J% 

where the infimum is taken over all measurable functions from M^ to M. 

As E [|y|2] < 00, C* is reached for / defined by f{g) = E [y|Gp = g]. 
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Each function h in Tinp can be written h 



foTTp, where / is chosen 



in Trip defined by 



^, 



np 



f eCi]RP,]R) 



L„ / p \ L„ \ 

fix) =^ aiT I Ao + X] ^''^^fc I ' ^^^'^ X] I"' ! < "n ^ 

1=1 \ k=\ J k=l ) 

Moreover, a function fnp E J'np such that hnp = fnp ° % has obviously 
the smahest empirical error among functions in J^np, that is 






Y 



i\2 



n 



i=l 



n 



i=l 



for all / G J'np- Then, according to theorem 2 from [31] and thanks 
to hypothesis on L„ and a„, for any fixed p, lim„^oo Cp(/„p) = C*. In 



other words, lim^ 



) LpV/i 



■np ) 



Cp (almost surely). 



6.2.2. Step two 

We show now that limp^oo C* = C* . Let us consider the sequence of 
random variables Xp = Ei[Y\Gp] and the sequence of cr-fields ^Ap = 
a{Gp). We first show that (A^p)pGW* is a filtration, i.e., that Aip C 
Aip^i- This is a simple consequence of the definition of Gp. Indeed, 
Gp = vrp(G) and therefore Gp = fp(Gp+i) where Vp is the function 
from JRP+i to IRP defined by 

^p\Xl, ■ ■ ■ , Xp, Xp_|_x j — l^Xi, . . . , Xp). 

As Gp is the composition of a continuous function and of Gp+i, the 
cj-field generated by Gp is a subset of the a-field generated by Gp+i. 

As E [|1^P] < oo, E [|y|] < OD. This allows to apply Lemma 35 from 
[35] (page 154), from which we conclude that {Xp)p^]i\f* is an uniformly 
integrable martingale for the A4p filtration. Therefore, according to 
Theorem 36 from [35] (page 154), (Xp)pgjv* converges almost surely to 



ac- 



an integrable random variable X^o- Moreover, as Xp = E [y|A^p], 
cording to the same theorem, X^o = E y a f U»ew* -^p ) • Obviously, 

we have a ( Upew* -^p ) ~ ^(^) ^^'^ therefore {Xp)p^jN* converges 
almost surely to E [y|G]. 

Finally, as E [|y|2] < oo, E 



\Xp\'] 



< E |y| < oo and therefore. 



the convergence also happens for the quadratic norm (see Corollary 
6.22 from [29]), i.e. 



Mm E 



(E[y|Gp]-E[y|G])2 



0. 
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This clearly implies limp^ooC* = C* (almost surely). 
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